-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvcoord: always start TxnCoordSender heartbeat loop #67215
Conversation
780a7b1
to
151d782
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for running these benchmarks. This is all very positive and indicates to me that this is the direction to move in. I am interested in exploring whether we can remove the HasTxnRecord
check altogether, but either way, these benchmarks show that this is a viable path forward that avoids all of the client-side complexity that we discussed in #57042.
Out of everything here, I'm actually most surprised that the added goroutine per 1PC txn didn't have an impact on performance. What did you think of the proposal to avoid this in the common case with a time.AfterFunc
call that's delayed by the first heartbeat interval? Timers are bound to a specific P in the Go runtime, so this should not create any new lock contention.
Reviewed 6 of 6 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @erikgrinaker)
pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater_test.go, line 241 at r1 (raw file):
etReq := ba.Requests[1].GetInner().(*roachpb.EndTxnRequest) require.False(t, etReq.TxnHeartbeating)
Why remove these assertions if we're keeping the field?
pkg/kv/kvserver/replica_test.go, line 11269 at r1 (raw file):
et, etH := endTxnArgs(txn, true /* commit */) et.InFlightWrites = inFlightWrites et.TxnHeartbeating = true
Do we want to remove these?
pkg/kv/kvserver/replica_write.go, line 415 at r1 (raw file):
// The EndTxn checks whether the txn record can be created, but we're // eliding the EndTxn. So, we'll do the check instead. ok, minCommitTS, reason := r.CanCreateTxnRecord(ctx, ba.Txn.ID, ba.Txn.Key, ba.Txn.MinTimestamp)
I do still wonder what it would take to make a positive result from CanCreateTxnRecord
imply a negative result from HasTxnRecord
. We create transaction records in exactly two places, EndTxn
and HeartbeatTxn
. See the state machine diagram on CanCreateTxnRecord
. So I think if we bumped the timestamp cache on HeartbeatTxn
like we do for EndTxn
in (*Replica).updateTimestampCache
then we may be able to avoid the HasTxnRecord
check altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am interested in exploring whether we can remove the HasTxnRecord check altogether
I'll give this a try as a separate commit (or PR, if you'd prefer).
Out of everything here, I'm actually most surprised that the added goroutine per 1PC txn didn't have an impact on performance. What did you think of the proposal to avoid this in the common case with a time.AfterFunc call that's delayed by the first heartbeat interval? Timers are bound to a specific P in the Go runtime, so this should not create any new lock contention.
In the general case, a write transaction will involve at least a network hop and an fsync, and the latency of spawning a goroutine will be dwarfed by these. I suppose it could have an effect under very high concurrency or load, although I suspect even then it will be marginal compared to other costs -- in any case, we shouldn't optimize this until it's been demonstrated to be a problem (the nightly roachperf benchmarks will presumably reveal this if it's significant).
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)
pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater_test.go, line 241 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why remove these assertions if we're keeping the field?
Mostly because I started off by just removing the field outright, but you're right, we should retain assertions that we're setting them as necessary for 21.1 compatibility.
pkg/kv/kvserver/replica_test.go, line 11269 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do we want to remove these?
Yeah, I think these are fine to remove since the code doesn't care about the field at all.
pkg/kv/kvserver/replica_write.go, line 415 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I do still wonder what it would take to make a positive result from
CanCreateTxnRecord
imply a negative result fromHasTxnRecord
. We create transaction records in exactly two places,EndTxn
andHeartbeatTxn
. See the state machine diagram onCanCreateTxnRecord
. So I think if we bumped the timestamp cache onHeartbeatTxn
like we do forEndTxn
in(*Replica).updateTimestampCache
then we may be able to avoid theHasTxnRecord
check altogether.
I'll have a look!
40c45ac
to
f7ded0b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)
pkg/kv/kvserver/replica_write.go, line 415 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I'll have a look!
I need to read up on the timestamp cache and the surrounding code, but I added a naïve proof of concept as a separate commit -- is something like this what you had in mind? It's unclear to me whether this is safe if the entry gets evicted from the cache, although I suppose in the best case we just leak a transaction record (assuming there is no possible scenario in which the txn could also have laid down an intent).
In any case, I ran a quick benchmark with this commit and it doesn't appear to have a significant effect, so I do wonder whether it's even worth it.
name old ops/sec new ops/sec delta
kv0/enc=false/nodes=1/cpu=32 23.2k ±12% 24.0k ± 4% ~ (p=0.497 n=10+9)
name old p50 new p50 delta
kv0/enc=false/nodes=1/cpu=32 2.52 ± 3% 2.50 ± 0% ~ (p=0.529 n=8+9)
name old p95 new p95 delta
kv0/enc=false/nodes=1/cpu=32 5.21 ±15% 5.00 ± 6% ~ (p=0.513 n=10+9)
name old p99 new p99 delta
kv0/enc=false/nodes=1/cpu=32 7.78 ±21% 7.21 ± 6% ~ (p=0.163 n=10+9)
I'm off next week, but let me know if you'd like me to merge the initial commit here and revisit the timestamp cache separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of everything here, I'm actually most surprised that the added goroutine per 1PC txn didn't have an impact on performance. What did you think of the proposal to avoid this in the common case with a time.AfterFunc call that's delayed by the first heartbeat interval? Timers are bound to a specific P in the Go runtime, so this should not create any new lock contention.
In the general case, a write transaction will involve at least a network hop and an fsync, and the latency of spawning a goroutine will be dwarfed by these. I suppose it could have an effect under very high concurrency or load, although I suspect even then it will be marginal compared to other costs -- in any case, we shouldn't optimize this until it's been demonstrated to be a problem (the nightly roachperf benchmarks will presumably reveal this if it's significant).
It's not about the latency of spawning a goroutine, I think, but about the impact on throughput. The kv0 benchmark you ran pushed the throughput to saturation?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)
pkg/kv/kvserver/replica_tscache.go, line 124 at r3 (raw file):
} case *roachpb.HeartbeatTxnRequest: key := transactionRecordMarker(start, txnID)
I think Nathan had in mind using TransactionTombstoneMarker
here.
pkg/kv/kvserver/replica_tscache.go, line 510 at r3 (raw file):
_, recordTxnID := r.store.tsCache.GetMax(recordKey, nil /* end */) if (recordTxnID != uuid.UUID{}) { return false, minCommitTS, 0
please use a new reason
pkg/kv/kvserver/replica_tscache.go, line 557 at r3 (raw file):
} }
spurious line?
pkg/kv/kvserver/replica_write.go, line 415 at r1 (raw file):
It's unclear to me whether this is safe if the entry gets evicted from the cache, although I suppose in the best case we just leak a transaction record
I think maybe there's some confusion about how the tscache works. Entries don't simply get evicted from it. Instead, the cache ratchets up the timestamp it tracks over a whole key spans, thereby loosing precision. The point is, once a key is tracked at ts1, the tscache will always return a ts >= ts1 for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not about the latency of spawning a goroutine, I think, but about the impact on throughput. The kv0 benchmark you ran pushed the throughput to saturation?
I believe so, yeah -- it used the default workload concurrency of 48 -- but I can do some more benchmarks to make sure once we finalize the design here.
Since these goroutines basically immediately block on a timer, I believe the only way they can have a significant effect on throughput is by putting pressure on the Go scheduler -- this is what I alluded to in my comment, as that'll only happen under very high concurrency or load. I don't really see that being a problem until we're beyond tens of thousands of concurrent goroutines/transactions, and at that scale I suspect other factors will dominate throughput. But it's worth running some benchmarks at very high concurrency to see what effect it has, if any.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @erikgrinaker, and @nvanbenschoten)
pkg/kv/kvserver/replica_tscache.go, line 124 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I think Nathan had in mind using
TransactionTombstoneMarker
here.
We need to distinguish between a transaction that has already been finalized (i.e. ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY
) and one that has written a non-finalized record. In canAttempt1PCEvaluation
, the first case should return a txn error while the second case shouldn't (it should instead take the 2PC branch). I don't think this is possible using transactionTombstoneMarker
alone. We could change the error code to e.g. ABORT_REASON_RECORD_EXISTS
, and then have callers check the txn status, take appropriate action, and convert to code to ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY
before propagating it to clients (to preserve API backwards compatibility). However, this feels convoluted and error-prone.
pkg/kv/kvserver/replica_tscache.go, line 510 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
please use a new reason
Yep, this is just a placeholder until I figure out how we want to solve this.
pkg/kv/kvserver/replica_tscache.go, line 557 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
spurious line?
Yup, fixed.
pkg/kv/kvserver/replica_write.go, line 415 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
It's unclear to me whether this is safe if the entry gets evicted from the cache, although I suppose in the best case we just leak a transaction record
I think maybe there's some confusion about how the tscache works. Entries don't simply get evicted from it. Instead, the cache ratchets up the timestamp it tracks over a whole key spans, thereby loosing precision. The point is, once a key is tracked at ts1, the tscache will always return a ts >= ts1 for it.
No, I get that. But once the low water mark passes this txn's entry timestamp, I believe GetMax()
will no longer return the txn ID, and so we can't know whether that entry ever existed. I thought that would, for our purposes, be equivalent to it getting evicted from the cache -- but I guess we can at least check the txn timestamp against the low water mark and know whether it might have been evicted. In that case I guess we'll want to do something similar to tombstones, i.e. return an ABORT_REASON_TIMESTAMP_CACHE_REJECTED
error to the client and retry the txn.
187cf47
to
10b98c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not about the latency of spawning a goroutine, I think, but about the impact on throughput. The kv0 benchmark you ran pushed the throughput to saturation?
I believe so, yeah -- it used the default workload concurrency of 48 -- but I can do some more benchmarks to make sure once we finalize the design here.
Since these goroutines basically immediately block on a timer, I believe the only way they can have a significant effect on throughput is by putting pressure on the Go scheduler -- this is what I alluded to in my comment, as that'll only happen under very high concurrency or load. I don't really see that being a problem until we're beyond tens of thousands of concurrent goroutines/transactions, and at that scale I suspect other factors will dominate throughput. But it's worth running some benchmarks at very high concurrency to see what effect it has, if any.
Like Nathan was saying, always starting hb goroutines should double the goroutine creation rate in some workloads. I would expect that to have some effect - for example because of allocating all the stacks. So, yeah, I would push harder on benchmarking (or just use the timer). A concurrency of 48 is not that high; depending on what machine you've tested against, I doubt that it saturates the CPU. Also make sure you're running kv0 with a uniform distribution across sufficiently many ranges.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @erikgrinaker, and @nvanbenschoten)
pkg/kv/kvserver/replica_tscache.go, line 124 at r3 (raw file):
In canAttempt1PCEvaluation, the first case should return a txn error while the second case shouldn't (it should instead take the 2PC branch
canAttempt1PCEvaluation
doesn't necessarily have to return an error in any case. It can just ignore the reason
returned by CanCreateTxnRecord
and returnfalse
, thus triggering the fallback to EndTxn
evaluation (which reads the txn record from storage instead of relying on the ts cache to tell it anything). In my opinion, that'd be better than introducing a 3rd tscache key dealing with txn records. We would, however, need to update the comments on the transactionTombstoneMarker()
to reflect the fact that it's only meaningful when there's no transaction record.
pkg/kv/kvserver/replica_write.go, line 415 at r1 (raw file):
I thought that would, for our purposes, be equivalent to it getting evicted from the cache
It's not equivalent. If the tombstone marker was simply never in the cache, then then CanCreateTxnRecord
might say true
. If it ever was, then it will always say false
(with either ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY
or ABORT_REASON_TIMESTAMP_CACHE_REJECTED
).
So, what I think we should do, take CanCreateTxnRecord() == false
to fallback to EndTxn
evaluation.
Makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's worth running some benchmarks at very high concurrency to see what effect it has, if any.
A concurrency of 48 is not that high; depending on what machine you've tested against
Agreed. I imagine we'll start seeing a more pronounced effect, if one exists, on larger machines. Have you tried running kv0 on a 32 vCPU machine with 1x replication? I'd imagine the most comprehensive trial would be to test with three variations on that configuration: first with what you have here, second with a timer to defer goroutine creation, and third without a heartbeat loop at all. Also, we don't necessarily need to make this change in this PR, but I would encourage you to keep pulling on this thread. Otherwise, it's going to be a while until we return to this.
Reviewed 1 of 1 files at r2, 3 of 3 files at r4, 3 of 3 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)
pkg/kv/kvserver/replica_tscache.go, line 124 at r3 (raw file):
canAttempt1PCEvaluation doesn't necessarily have to return an error in any case. It can just ignore the reason returned by CanCreateTxnRecord and returnfalse, thus triggering the fallback to EndTxn evaluation (which reads the txn record from storage instead of relying on the ts cache to tell it anything). In my opinion, that'd be better than introducing a 3rd tscache key dealing with txn records. We would, however, need to update the comments on the transactionTombstoneMarker() to reflect the fact that it's only meaningful when there's no transaction record.
Yes, I agree with both of the points here. Instead of introducing a new key, I think it would be better to fall back to EndTxn evaluation (return false from canAttempt1PCEvaluation
) when CanCreateTxnRecord
returns false.
02bca19
to
489edcf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
always starting hb goroutines should double the goroutine creation rate in some workloads. I would expect that to have some effect - for example because of allocating all the stacks. So, yeah, I would push harder on benchmarking (or just use the timer). A concurrency of 48 is not that high; depending on what machine you've tested against, I doubt that it saturates the CPU. Also make sure you're running kv0 with a uniform distribution across sufficiently many ranges.
Agreed. I imagine we'll start seeing a more pronounced effect, if one exists, on larger machines. Have you tried running kv0 on a 32 vCPU machine with 1x replication? I'd imagine the most comprehensive trial would be to test with three variations on that configuration: first with what you have here, second with a timer to defer goroutine creation, and third without a heartbeat loop at all.
You're right, deferring the heartbeat loop with time.AfterFunc()
increases throughput by about ~3% when running 4096 concurrent kv0 writers on a single 32-CPU node with 1000 ranges. I've run a bunch of benchmarks here, choice picks:
old=always goroutine + tscache new=afterfunc + tscache
name old ops/sec new ops/sec delta
kv0/enc=false/nodes=1/cpu=32/conc=4096 42.1k ± 1% 43.5k ± 1% +3.32% (p=0.000 n=9+10)
name old p50 new p50 delta
kv0/enc=false/nodes=1/cpu=32/conc=4096 96.5 ± 0% 92.3 ± 0% -4.35% (p=0.000 n=8+8)
name old p95 new p95 delta
kv0/enc=false/nodes=1/cpu=32/conc=4096 168 ± 0% 176 ± 0% +5.01% (p=0.000 n=10+10)
name old p99 new p99 delta
kv0/enc=false/nodes=1/cpu=32/conc=4096 193 ± 0% 199 ± 3% +3.05% (p=0.005 n=9+10)
old=afterfunc + tscache new=no heartbeat loop + tscache
name old ops/sec new ops/sec delta
kv0/enc=false/nodes=1/cpu=32/conc=4096 43.5k ± 1% 43.6k ± 2% ~ (p=0.274 n=10+8)
name old p50 new p50 delta
kv0/enc=false/nodes=1/cpu=32/conc=4096 92.3 ± 0% 92.8 ± 4% ~ (p=1.000 n=8+8)
name old p95 new p95 delta
kv0/enc=false/nodes=1/cpu=32/conc=4096 176 ± 0% 176 ± 0% ~ (all equal)
name old p99 new p99 delta
kv0/enc=false/nodes=1/cpu=32/conc=4096 199 ± 3% 195 ± 3% ~ (p=0.153 n=10+8)
old=afterfunc + HasTxnRecord new=afterfunc + tscache
name old ops/sec new ops/sec delta
kv0/enc=false/nodes=1/cpu=32/conc=4096 42.1k ± 1% 43.4k ± 1% +2.90% (p=0.000 n=10+9)
name old p50 new p50 delta
kv0/enc=false/nodes=1/cpu=32/conc=4096 96.5 ± 0% 92.3 ± 0% -4.35% (p=0.000 n=10+8)
name old p95 new p95 delta
kv0/enc=false/nodes=1/cpu=32/conc=4096 181 ± 3% 176 ± 0% -2.75% (p=0.006 n=10+9)
name old p99 new p99 delta
kv0/enc=false/nodes=1/cpu=32/conc=4096 201 ± 0% 201 ± 0% ~ (all equal)
old=master new=afterfunc + tscache
name old ops/sec new ops/sec delta
kv0/enc=false/nodes=1/cpu=32/conc=4096 43.8k ± 2% 43.4k ± 1% -0.99% (p=0.015 n=8+9)
name old p50 new p50 delta
kv0/enc=false/nodes=1/cpu=32/conc=4096 92.3 ± 0% 92.3 ± 0% ~ (all equal)
name old p95 new p95 delta
kv0/enc=false/nodes=1/cpu=32/conc=4096 176 ± 0% 176 ± 0% ~ (all equal)
name old p99 new p99 delta
kv0/enc=false/nodes=1/cpu=32/conc=4096 193 ± 0% 201 ± 0% +4.35% (p=0.000 n=8+9)
I've updated the PR with time.AfterFunc
and using the tscache tombstone markers to check records, should be good to review (barring any test failures). Thanks for pushing on this, I'm just wary of trading complexity for performance without measuring it first.
time.AfterFunc
will end up spawning two goroutines (one for the AfterFunc
and another for the nested RunAsyncTask
), which may explain some of the increased p99 latencies and the slight throughput reduction from master
. I'm guessing we want to keep using RunAsyncTask
since we enforce its use, but let me know if you think the bare AfterFunc
goroutine will do here.
Dismissed @andreimatei from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)
pkg/kv/kvserver/replica_tscache.go, line 124 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
canAttempt1PCEvaluation doesn't necessarily have to return an error in any case. It can just ignore the reason returned by CanCreateTxnRecord and returnfalse, thus triggering the fallback to EndTxn evaluation (which reads the txn record from storage instead of relying on the ts cache to tell it anything). In my opinion, that'd be better than introducing a 3rd tscache key dealing with txn records. We would, however, need to update the comments on the transactionTombstoneMarker() to reflect the fact that it's only meaningful when there's no transaction record.
Yes, I agree with both of the points here. Instead of introducing a new key, I think it would be better to fall back to EndTxn evaluation (return false from
canAttempt1PCEvaluation
) whenCanCreateTxnRecord
returns false.
True, we can defer error generation to EndTxn
, I've implemented this. I'm not thrilled about the semantic complexity of tombstone markers also acting as markers for existing (non-deleted) records, but it gives the right result at least. We may want to consider renaming the markers, although it's not clear how to do this consistently without also changing ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY
.
pkg/kv/kvserver/replica_tscache.go, line 557 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Yup, fixed.
Done.
pkg/kv/kvserver/replica_write.go, line 415 at r1 (raw file):
It's not equivalent. If the tombstone marker was simply never in the cache, then then CanCreateTxnRecord might say true. If it ever was, then it will always say false (with either ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY or ABORT_REASON_TIMESTAMP_CACHE_REJECTED).
Yeah, but that's only because a transaction timestamp below the low water mark will never be allowed to write a transaction record, right? The bit I missed was that the caller can use the timestamps to know whether a "negative" (i.e. empty txn ID) is definitely true or possibly false. The existing behavior of CanCreateTxnRecord
makes sense to me, and serves as a good illustration.
f598180
to
3483057
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time.AfterFunc will end up spawning two goroutines (one for the AfterFunc and another for the nested RunAsyncTask), which may explain some of the increased p99 latencies and the slight throughput reduction from master.
This obviously isn't the case, since the p99 latencies are well below 1 second.
Anyway, I did another benchmark run with the latest PR code but always spawning a goroutine, to doublecheck that I hadn't gotten anything mixed up, which yielded:
name old ops/sec new ops/sec delta
kv0/enc=false/nodes=1/cpu=32/conc=4096 43.2k ± 0% 42.0k ± 1% -2.79% (p=0.000 n=9+10)
name old p50 new p50 delta
kv0/enc=false/nodes=1/cpu=32/conc=4096 93.6 ± 3% 98.6 ± 2% +5.39% (p=0.001 n=10+10)
name old p95 new p95 delta
kv0/enc=false/nodes=1/cpu=32/conc=4096 176 ± 0% 168 ± 0% -4.77% (p=0.000 n=10+10)
name old p99 new p99 delta
kv0/enc=false/nodes=1/cpu=32/conc=4096 201 ± 0% 193 ± 0% -4.17% (p=0.000 n=10+9)
The tail latencies here are back down to the master
baseline, so it seems like they can be explained by time.AfterFunc
-- I think this is a reasonable tradeoff. Throughput reduction is consistent with previous results.
As for the ~1% throughput reduction we saw before when comparing master
with this PR, I suspect that's attributable to the additional processing on the hot path: the heartbeat loop preparation and tscache lookup. Since this only seems to have an effect under high concurrency when the node is overloaded, and not during nominal loads, I think that's probably OK; wdyt?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @erikgrinaker, and @nvanbenschoten)
3483057
to
531864c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time.AfterFunc will end up spawning two goroutines (one for the AfterFunc and another for the nested RunAsyncTask), which may explain some of the increased p99 latencies and the slight throughput reduction from master. I'm guessing we want to keep using RunAsyncTask since we enforce its use, but let me know if you think the bare AfterFunc goroutine will do here.
If time.AfterFunc
's callback runs in a new goroutine, then I don't see a strong reason to use RunAsyncTask
. Do you? We could use RunTask
to ensure that the goroutine is tracked in the Stopper without the extra hop.
Reviewed 1 of 4 files at r8, 2 of 4 files at r9.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @erikgrinaker, and @nvanbenschoten)
pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 311 at r9 (raw file):
// If we fail to kick off the heartbeat loop, any changes to the txn // record will be detected on commit or rollback, so we just log it. log.VEventf(ctx, 1, "Failed to start txn heartbeat loop: %v", err)
nit: most log lines don't start with a capital letter.
pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 316 at r9 (raw file):
h.mu.loopCancel = func() { if !timer.Stop() {
Any reason not to call hbCancel
either way?
pkg/kv/kvserver/replica_tscache.go, line 124 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
True, we can defer error generation to
EndTxn
, I've implemented this. I'm not thrilled about the semantic complexity of tombstone markers also acting as markers for existing (non-deleted) records, but it gives the right result at least. We may want to consider renaming the markers, although it's not clear how to do this consistently without also changingABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY
.
I agree with your hesitation towards adding in a "special case" here. A lot of these "the marker is also overloaded" comments are quite unfortunate and we should try to avoid them.
But I don't necessarily think this is overloading the meaning. In its strictest form, a tombstone is written when removing an object to prevent it from being recreated. But in practice, it's often easier to create the tombstone at creation time instead. #64966 is another example of that. As it turns out, even EndTxn's handling here is an example of that, as we don't necessarily GC the transaction record during an EndTxn, we just mark it as finalized.
So I'd see how far you can get in updating these interactions without considering this a special case. If that means you need to change the name of ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY
, that seems fine.
pkg/kv/kvserver/replica_tscache.go, line 119 at r9 (raw file):
// transaction's MinTimestamp, which is consulted in // CanCreateTxnRecord. if br.Txn.Status.IsFinalized() {
Related to the discussion below, I think we'll also want to remove this condition and write the tombstone marker even if the EndTxn has created our txn record with a STAGING status.
pkg/kv/kvserver/replica_write.go, line 394 at r9 (raw file):
func (r *Replica) canAttempt1PCEvaluation( ctx context.Context, ba *roachpb.BatchRequest, latchSpans *spanset.SpanSet, ) (bool, *roachpb.Error) {
Will we ever return an error from this method anymore? If not, let's remove the error return value.
pkg/kv/kvserver/replica_write.go, line 414 at r9 (raw file):
2PC evaluation
I don't know that we would necessarily consider the non-1PC fast-path to be "2 phase-commit". Consider just saying "non-1PC"
pkg/roachpb/api.go, line 1217 at r9 (raw file):
// HeartbeatTxn updates the timestamp cache with transaction records, // to avoid checking for them on disk when considering 1PC evaluation.
nit: this makes it sound like HeartbeatTxnRequest is the one checking for them. Consider rewording to "to avoid the need to check for them on disk when considering 1PC evaluation".
531864c
to
87d6456
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If time.AfterFunc's callback runs in a new goroutine, then I don't see a strong reason to use RunAsyncTask. Do you? We could use RunTask to ensure that the goroutine is tracked in the Stopper without the extra hop.
Not at all, forgot about RunTask
. Updated.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @erikgrinaker, and @nvanbenschoten)
pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 316 at r9 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Any reason not to call
hbCancel
either way?
I could go either way, just figured we'd avoid a channel close which I'd expect to involve some sort of CPU synchronization instruction. Not likely to matter much either way, so updated it to always cancel.
pkg/kv/kvserver/replica_tscache.go, line 124 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I agree with your hesitation towards adding in a "special case" here. A lot of these "the marker is also overloaded" comments are quite unfortunate and we should try to avoid them.
But I don't necessarily think this is overloading the meaning. In its strictest form, a tombstone is written when removing an object to prevent it from being recreated. But in practice, it's often easier to create the tombstone at creation time instead. #64966 is another example of that. As it turns out, even EndTxn's handling here is an example of that, as we don't necessarily GC the transaction record during an EndTxn, we just mark it as finalized.
So I'd see how far you can get in updating these interactions without considering this a special case. If that means you need to change the name of
ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY
, that seems fine.
I suppose that's true, if we squint a little. Changed the reason to ABORT_REASON_RECORD_ALREADY_WRITTEN_POSSIBLE_REPLAY
. Not sure to what extent we try to maintain backwards compatibility, I suppose it's fine for the next major release as long as it's called out in release notes.
pkg/kv/kvserver/replica_tscache.go, line 119 at r9 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Related to the discussion below, I think we'll also want to remove this condition and write the tombstone marker even if the EndTxn has created our txn record with a STAGING status.
Good catch, done.
pkg/kv/kvserver/replica_write.go, line 394 at r9 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Will we ever return an error from this method anymore? If not, let's remove the error return value.
Good catch, done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mod the tracing questions. Nice job on this.
Reviewed 1 of 5 files at r10, 6 of 6 files at r11.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @erikgrinaker)
pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 305 at r11 (raw file):
// We want the loop to run in a span linked to the current one, so we put // our span in the context and fork it. hbCtx = tracing.ContextWithSpan(hbCtx, tracing.SpanFromContext(ctx))
Is the use of ctx
here racy? We're going to call h.mu.loopCancel
on the way back through this interceptor if we're planning on aborting and finishing the span, so we'll call timer.Stop
before any span in the context is finished, but that doesn't mean that this timer won't fire. So it's possible for this logic to outlive the tracing span in its context, right? Will that cause issues?
cc. @andreimatei I don't have a good mental model of how this works, other than that I know I've been burnt by it before. Any thoughts?
pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 306 at r11 (raw file):
// our span in the context and fork it. hbCtx = tracing.ContextWithSpan(hbCtx, tracing.SpanFromContext(ctx)) hbCtx, _ = tracing.ForkSpan(hbCtx, taskName)
Do we need to .Finish()
this tracing span at some point?
87d6456
to
8e2150d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @nvanbenschoten)
pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 305 at r11 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is the use of
ctx
here racy? We're going to callh.mu.loopCancel
on the way back through this interceptor if we're planning on aborting and finishing the span, so we'll calltimer.Stop
before any span in the context is finished, but that doesn't mean that this timer won't fire. So it's possible for this logic to outlive the tracing span in its context, right? Will that cause issues?cc. @andreimatei I don't have a good mental model of how this works, other than that I know I've been burnt by it before. Any thoughts?
Contexts are generally expected to be immutable, so any modification to a context (and its contained values) should be done by creating a new context. I have no idea if the tracing stuff follows those conventions though.
I updated the code to do set up the tracing span synchronously, out of caution. The span forking was previously done on the async goroutine, so will do a quick benchmark run and see how it affects performance.
pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 306 at r11 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do we need to
.Finish()
this tracing span at some point?
We do, nice catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @nvanbenschoten)
pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 305 at r11 (raw file):
will do a quick benchmark run and see how it affects performance.
No effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @erikgrinaker, and @nvanbenschoten)
pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 308 at r13 (raw file):
timer := time.AfterFunc(h.loopInterval, func() { if err := h.stopper.RunTask(hbCtx, taskName, h.heartbeatLoop); err != nil { // If we fail to kick off the heartbeat loop, any changes to the txn
I think few people would understand what this comment is telling them. I don't necessarily understand it myself. I'd remove it; I don't feel a need for any comment here.
I'd also get rid of the logging or, if I keep it, it'd put it lower than level 1
(at least the more usual 2). RunTask
fails only if the stopper is quiescing; I don't think that deserves any logging.
pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 310 at r13 (raw file):
// If we fail to kick off the heartbeat loop, any changes to the txn // record will be detected on commit or rollback, so we just log it. log.VEventf(ctx, 1, "failed to start txn heartbeat loop: %v", err)
nit: s/%v/%s. %v
usually tells me that something unusual is going on - like for example that err can be nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @erikgrinaker, and @nvanbenschoten)
pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 305 at r11 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
will do a quick benchmark run and see how it affects performance.
No effect.
was there a race in the hbCtx = tracing.ContextWithSpan(hbCtx, tracing.SpanFromContext(ctx))
line, regardless of whether timer.Stop
was called or not? I don't see it.
I'd put the lines
hbCtx = tracing.ContextWithSpan(hbCtx, tracing.SpanFromContext(ctx))
hbCtx, span := tracing.ForkSpan(hbCtx, taskName)
back into the timer func. I'd also put the span.Stop()
inside the timer func. That way, you don't open the door to a use-after-finish in the VEventf()
when loopCancel
is called. We've made our spans resilient to this kind of use-after-finish, but it's still not nice.
8e2150d
to
42ffe12
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @nvanbenschoten)
pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 305 at r11 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
was there a race in the
hbCtx = tracing.ContextWithSpan(hbCtx, tracing.SpanFromContext(ctx))
line, regardless of whethertimer.Stop
was called or not? I don't see it.I'd put the lines
hbCtx = tracing.ContextWithSpan(hbCtx, tracing.SpanFromContext(ctx)) hbCtx, span := tracing.ForkSpan(hbCtx, taskName)
back into the timer func. I'd also put the
span.Stop()
inside the timer func. That way, you don't open the door to a use-after-finish in theVEventf()
whenloopCancel
is called. We've made our spans resilient to this kind of use-after-finish, but it's still not nice.
Ok, done. It wasn't clear to me whether SpanFromContext()
and ForkSpan()
were concurrency-safe.
pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 308 at r13 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I think few people would understand what this comment is telling them. I don't necessarily understand it myself. I'd remove it; I don't feel a need for any comment here.
I'd also get rid of the logging or, if I keep it, it'd put it lower than level1
(at least the more usual 2).RunTask
fails only if the stopper is quiescing; I don't think that deserves any logging.
Removed. The idea was to make it clear that it is safe to run the transaction without a heartbeat loop.
pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 310 at r13 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
nit: s/%v/%s.
%v
usually tells me that something unusual is going on - like for example that err can be nil
Updated.
42ffe12
to
b36b4c0
Compare
Previously, potential 1PC transactions (containing an `EndTxn` in the same batch as writes) did not start the `TxnCoordSender` heartbeat loop, and thus never wrote a transaction record. This was a performance optimization to avoid checking for an existing record during write evaluation. However, it could cause problems under heavy contention where such transactions would abort each other as they did not heartbeat their transaction record and thus would become invalid once the txn timeout expired. This would apply even to transactions that turned out to not use 1PC due to spanning multiple ranges, and thus are more vulnerable to contention. However, benchmarks showed that this optimization did not actually have a noticable effect on performance (new always heartbeats): ``` name old ops/sec new ops/sec delta kv0/enc=false/nodes=1/cpu=32 24.5k ± 8% 24.7k ± 3% ~ (p=0.796 n=9+9) kv0/enc=false/nodes=3/cpu=32 36.7k ± 4% 36.5k ± 4% ~ (p=0.529 n=10+10) kv95/enc=false/nodes=3/cpu=32 124k ± 7% 131k ± 2% +5.56% (p=0.002 n=10+8) ycsb/A/nodes=3/cpu=32 27.7k ± 4% 28.1k ± 4% ~ (p=0.315 n=9+10) name old p50 new p50 delta kv0/enc=false/nodes=1/cpu=32 2.50 ± 0% 2.50 ± 0% ~ (all equal) kv0/enc=false/nodes=3/cpu=32 4.70 ± 0% 4.70 ± 0% ~ (all equal) kv95/enc=false/nodes=3/cpu=32 1.00 ± 0% 1.00 ± 0% ~ (all equal) ycsb/A/nodes=3/cpu=32 2.44 ± 2% 2.45 ± 2% ~ (p=1.000 n=10+10) name old p95 new p95 delta kv0/enc=false/nodes=1/cpu=32 4.79 ± 9% 4.70 ± 0% ~ (p=0.350 n=9+7) kv0/enc=false/nodes=3/cpu=32 10.7 ± 3% 10.8 ± 7% ~ (p=1.000 n=10+10) kv95/enc=false/nodes=3/cpu=32 5.02 ±10% 4.70 ± 0% -6.37% (p=0.003 n=10+8) ycsb/A/nodes=3/cpu=32 9.10 ± 8% 9.20 ± 3% ~ (p=0.720 n=10+10) name old p99 new p99 delta kv0/enc=false/nodes=1/cpu=32 6.90 ±10% 6.87 ± 6% ~ (p=1.000 n=9+9) kv0/enc=false/nodes=3/cpu=32 14.4 ± 9% 14.4 ± 6% ~ (p=0.707 n=10+9) kv95/enc=false/nodes=3/cpu=32 9.79 ± 7% 9.21 ± 3% -5.90% (p=0.015 n=10+8) ycsb/A/nodes=3/cpu=32 78.0 ± 9% 75.5 ± 6% ~ (p=0.305 n=10+10) ``` This patch therefore always starts the heartbeat loop for all transactions, but only after the heartbeat interval (1 second) passes. In the typical 1PC case, we therefore do not spawn a goroutine nor write a transaction record anyway. The `EndTxn.TxnHeartbeating` field is retained for compatibility with v21.1 nodes, otherwise these nodes would never check for txn records when evaluating possible 1PC writes. Release note (performance improvement): Improved concurrency control for heavily contended write queries outside of transactions that touch multiple ranges, reducing excessive aborts and retries.
Potential 1PC transactions must currently check for an existing transaction record on disk. This patch changes the timestamp cache tombstone markers to be added when the record is first written rather than when it's finalized, which allows using the timestamp cache for this check, avoiding a disk read. This also changes the name `ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY` to `ABORT_REASON_RECORD_ALREADY_WRITTEN_POSSIBLE_REPLAY`, to reflect the new tombstone semantics. The Protobuf identifier is retained for wire compatibility. Release note (api change): The transaction abort error reason `ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY` has been renamed to `ABORT_REASON_RECORD_ALREADY_WRITTEN_POSSIBLE_REPLAY`.
b36b4c0
to
1a9a35c
Compare
Since @andreimatei is out on vacation I won't wait for his approval here. His comments have been addressed. TFTRs! bors r=nvanbenschoten,andreimatei |
Build succeeded: |
kvcoord: always start TxnCoordSender heartbeat loop
Previously, potential 1PC transactions (containing an
EndTxn
in thesame batch as writes) did not start the
TxnCoordSender
heartbeat loop,and thus never wrote a transaction record. This was a performance
optimization to avoid checking for an existing record during write
evaluation. However, it could cause problems under heavy contention
where such transactions would abort each other as they did not heartbeat
their transaction record and thus would become invalid once the txn
timeout expired. This would apply even to transactions that turned out
to not use 1PC due to spanning multiple ranges, and thus are more
vulnerable to contention.
However, benchmarks showed that this optimization did not actually have
a noticable effect on performance (new always heartbeats):
This patch therefore always starts the heartbeat loop for all
transactions, but only after the heartbeat interval (1 second) passes.
In the typical 1PC case, we therefore do not spawn a goroutine nor
write a transaction record anyway.
The
EndTxn.TxnHeartbeating
field is retained for compatibility withv21.1 nodes, otherwise these nodes would never check for txn records
when evaluating possible 1PC writes.
Resolves #57042.
Release note (performance improvement): Improved concurrency control for
heavily contended write queries outside of transactions that touch
multiple ranges, reducing excessive aborts and retries.
kvserver: use timestamp cache to mark and check txn records
Potential 1PC transactions must currently check for an existing
transaction record on disk. This patch changes the timestamp cache
tombstone markers to be added when the record is first written rather
than when it's finalized, which allows using the timestamp cache for
this check, avoiding a disk read.
This also changes the name
ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY
toABORT_REASON_RECORD_ALREADY_WRITTEN_POSSIBLE_REPLAY
, to reflect thenew tombstone semantics. The Protobuf identifier is retained for wire
compatibility.
Release note (api change): The transaction abort error reason
ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY
has beenrenamed to
ABORT_REASON_RECORD_ALREADY_WRITTEN_POSSIBLE_REPLAY
.